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(57) Abstract 

A system and method for generating struc- 
tured data outputs from a semi-structured data 
source. The steps of this method include generating 
an example output from an example generator (14). 
The example output is generated in response to the 
acquisition of a sequence of annotated strings (12). 
The annotated strings are generated in response to 
the acquisition and modification of at least one data 
example and corresponding coarse structure from a 
predetermined input source (10). Also, a second se- 
quence of annotated strings is generated from input 
from a semi-structured data source (16). Bom the 
example output and the second sequence of anno- 
tated strings are input to an acquisition engine (18) 
that implements a grammar layer incorporating a 
top-down parsing method and a comparison layer. 
The structured data outputs are generated through 
the cooperation of the comparison layer and the 
grammar layer (20). 
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METHOD AND SYSTEM FOR GENERATING STRUCTURED DATA FROM SEMI- 

STRUCTURED DATA SOURCES 

TECHNICAL FIELD OF THE INVENTION 

The present invention relates generally to data 
acquisition and structuring systems and methods, and more 
particularly, a system and method for generating structured 
data outputs from semi -structured data inputs. 

BACKGROUND OF THE INVENTION 

The general field of this invention relates to 
generating structured data outputs from semi-structured 
data inputs, A particular application of the invention is 
acquiring and structuring data to form virtual internet 
databases. Virtual internet databases are databases whose 
content is owned, stored and managed on servers distributed 
across a computer network. 

Recently, internet usage and access has increased 
markedly. The availability and quantity of information on 
the internet has also increased. Many software products 
that can produce printed reports can now produce WEB 
reports* These products produce reports that may be 
displayed on a WEB page. This is accomplished by embedding 
the text of the report within the computer language called 
HTML. Although posted reports and information appear as 
data on the WEB page, this HTML representation is not a 
data representation. Rather, the WEB browser serves as a 
vehicle to display information much like that of a page in 
a textbook. This presents the problem of incompatibility 
between the HTML representation and the PC desktop and 
server applications. Ultimately, the current practice of 
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employing WEB browsers has reduced PCs back to "dumb" 
terminals. The graphics may be exciting, but functionally 
all the computing power is limited to providing users with 
little more than a sophisticated data viewing window. 

Several methods have been developed to address the 
problem of moving semi -structured data from the internet to 
a PC or server application. These methods include ad hoc 
engineering methods, Graphical User Interface (GUI) 
methods, and machine learning methods. 

Ad hoc methods entail writing specialized parsing 
programs in a language such as PERL or LEX to extract the 
necessary information. These types of programs are called 
wrappers. A wrapper is a software method that converts data 
such as HTML code into structured data for further 
processing. These types of programs employ the use of 
regular expressions in the parsing process. Unfortunately, 
these ad hoc methods are labor intensive. Depending on the 
skill of the programmer and the complexity of the 
particular job, these methods can take days to develop. 
Also, these methods are not an option for an average 
internet user with no formal training or knowledge of HTML 
and programming methods . 

Due to the tedious nature of custom wrapper design, 
further methods have been developed that employ GUIs to 
facilitate the wrapper generation. The GUI hides all the 
engineering details beyond the extracted data pattern 
definitions. Like the ad hoc methods discussed above, 
these packages implement regular expression parsing 
algorithms. In general these methods require some knowledge 
of both HTML and regular expressions, therefore they may 
not be suitable to some internet users. 
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Due to the use of regular expressions, both ad hoc 
methods and GUI methods can result in what is called 
brittle parses . Brittle parses result when changes in 
format of the HTML page cause the parse to fail. A single 
format change is not guaranteed to break the parse, but the 
likelihood is sufficiently high as to prevent any 
guarantees of robust behavior. 

Recently, machine learning methods have been developed 
to address the need for engineering skills in the 
development of wrappers. Given a set of similar WEB pages 
and an example of the data to be parsed from each page, 
these methods automatically generate a wrapper. 
Unfortunately, these methods require a large number of 
examples to reliably produce wrappers. An example of such 
a method can be found in A Hierarchical Approach to Wrapper 
Induction, Muslea, et al. (1999). This method may require 
8-10 examples to produce the wrappers. The generated 
wrappers are based on regular expression techniques and are 
brittle. Although these wrappers may work for format 
changes known prior to wrapper generation, they may fail on 
empirical format changes as the regular expression based 
methods discussed above. 

Ideally, it is desirable to develop a method for a 
user to gain access to semi -structured data for a PC or 
server application without requiring the user to have 
previous knowledge HTML or regular expressions. In 
addition, it is advantageous if the method does not require 
the enumeration of examples covering possible format 
changes . 
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SUMMARY OF THE INVENT TOM 

The present invention provides a system and method for 
acquiring and structuring data from semi -structured data 
sources that substantially eliminates or reduces 
disadvantages and problems associated with previously 
developed systems and methods used for developing 
structured data sources from on-line sources such as the 
Internet, intranets, or other network systems. 

More specifically, the present invention provides a 
system for generating structured data outputs from semi- 
structured data sources. The steps of this method include 
generating an example output from an example generator. The 
example output is generated in response to the acquisition 
of a sequence of annotated strings. The annotated strings 
are generated in response to the acquisition and 
modification of as little as one data example and a 
corresponding coarse structure from a predetermined input 
source. Also, a second sequence of annotated strings in 
generated from input from a semi -structured data source. 
Both the example output and second sequence of annotated 
strings are input to an acquisition engine that implements 
a grammar layer incorporating a top-down parsing method and 
a comparison layer. The structured data outputs are 
generated through the cooperation of the comparison layer 
and the grammar layer. 

The present invention provides an important technical 
advantage in that it does not require the user to have 
knowledge of HTML or knowledge of pattern matching 
languages. The graphical interface guides the user through 
a set-up phase and completely hides all technical details. 
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The present invention provides an important technical 
advantage in that it requires only one single data example. 
Once this set-up process is complete, the acquisition 
engine can be pointed to related WEB pages, as well as up- 
dated versions of the same page, and it will automatically 
extract data and route it to applications. 

The present invention provides yet another technical 
advantage in that the system is able to cope with the 
format changes from the source pages, including changes in 
the order of data values. Thus, the technology produces 
reliable results even when the data sources are re- 
formatted, updated or amended by the content providers. 

BRIEF DESCRIPTION OF THE D RAWINGS 

For a more complete understanding of the present 
invention and the advantages thereof, reference is now made 
to the following description taken in conjunction with the 
accompanying drawings in which like reference numerals 
indicate like features and wherein: 

FIGURE 1 is a flow diagram of one embodiment of the 
present invention; 

FIGURE 2 is a block diagram of the gross architectural 
breakdown 10 of an embodiment of the present invention; 

FIGURE 3 is a flow diagram for the generation of HTML 
phonemes of the embodiment of FIGURE 1; 

FIGURE 4 illustrates the decomposition of HTML stings 
into tokens and phonemes; 

FIGURE 5 is an example of the GUI used to extract 
example data items and the corresponding structure; and 
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FIGURE 6 represents an example of the pattern 
dictionary including patterns of phonemes and the 
corresponding terminals of the context free grammar. 

5 DETAILED DESCRIPTION OF TH E INVENTION 

Preferred embodiments of the present invention are 
illustrated in the FIGURES, like numerals being used to 
refer to like and corresponding parts of the various 
drawings . 

10 The present invention provides a system and method for 

generating structural data outputs from semi -structured 
data inputs. Details of embodiments of the present 
invention are discussed below. 

FIGURE 1 is the flow diagram of one embodiment of the 

15 present invention. At step 10, at least one data example 

and coarse structure is acquired from a predetermined input 
specified by an external output. The at least one data 
example can be exactly one data example. This 
predetermined input serves to present an example of the 

20 type of data to be acquired and structured. Such a type of 

predetermined input can be PDF files , semi -structured text 
files, or HTML files. The external source can be a storage 
means such as a database server or a WEB server. At step 
12, the data example and coarse structure are modified to 

25 produce a first set of annotated strings. These annotated 

strings serve as data structures providing one or more 
attributes regarding each data example and the coarse 
structure . 

At step 14, an example output is generated in an 
30 example generator from the first set of annotated strings. 

The example output comprises a pattern dictionary 
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containing at least one annotated string associated with a 
terminal that represents a terminal of a context free 
grammar. At step 16, a second set of annotated stings are 
generated from at least one semi -structured data source. 
The semi -structured data sources serve as the source of 
data which is to be acquired and structured. A source for 
such semi -structured data sources can be a database server 
or WEB server. At step 18, the example output and the 
second set of annotated strings are acquired by the 
acquisition engine. In turn, at step 20, the acquisition 
engine generates the structured data outputs from 
cooperation of a grammar layer and a comparison layer 
contained within the acquisition engine. The grammar layer 
and comparison layer work in cooperation to locate in the 
second set of annotated strings the desired data outputs 
based on the example output from the example generator. 

FIGURE 2 is a block diagram of the gross architectural 
breakdown 20 of one embodiment of the present invention. 
The gross architectural breakdown 2 0 can be divided into 
two major parts: the training stage 26 and the acquisition 
stage 28. The internet 24 provides both the training stage 
26 and the acquisition stage 28 with an input WEB training 
page 32 to be used to extract an example and train the 
system as to the type of information and format of 
information desired. The internet 24 also provides the 
acquisition stage 28 with the semi -structured data sources, 
incoming HTML pages 34, to search for and structure the 
type of data specified by the training stage 26. These two 
stages, the training stage 26 and the acquisition stage 28, 
can be further broken down. The training stage 26 is 
comprised of a GUI 36, preprocessor 38, and a builder 46 
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containing an example generator 48. The GUI 36 is used to 
extract information from the input WEB training page 32 
located on the internet 24. The preprocessor 38 then 
interfaces with the GUI 3 6 to produces HTML phonemes 4 0 
5 representing the extracted information from the input WEB 

training page 32 . 

The HTML phonemes 40 are input to the example 
generator 48. Example generator 48 converts the HTML 
phonemes 40 into a series of patterns which populate a 

10 pattern dictionary 50 and generates a context-free grammar 

52. Patterns in the pattern dictionary 50 may include the 
user input with phonemes 40 on each side and a 
corresponding weight for each phoneme. There can be 
multiple patterns in the pattern dictionary 50. The 

15 pattern dictionary 50 and the context-free grammar 52 are 

then input into the acquisition stage 28, specifically the 
acquisition engine 54. HTML phonemes 44 generated from the 
incoming HTML page 34 through the use of a preprocessor 42 # 
are also input into the acquisition engine 54. The 

20 acquisition engine 54 can be broken down further into a 

grammar layer 56 and a comparison layer 58. The pattern 
dictionary 50 and a context free grammar 52 are used to 
extract the structured data outputs 3 0 contained within the 
HTML phonemes 44. These structured data outputs 30 are 

25 outputs of the acquisition engine 54 . 

FIGURE 3 is a flow diagram for the generation of HTML 
phonemes 40, 44. A pure HTML representation 62 of the 
incoming HTML information from the GUI 3 6 or the incoming 
HTML page 34 is created from step 60 . The incoming HTML 

30 information may contain scripts and/or call backs to the 

web-server, so called active components. At step 60, these 
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active components of the incoming HTML information are 
converted to a pure HTML representation 62 of the HTML 
information. In turn, lexical analysis is performed at 
step 64 by breaking the pure HTML representation 62 into 
substrings called tokens 66. The tokens 66 are then 
adorned with characteristic features at step 68 which 
outputs the HTML phonemes 40, 44. These characteristic 
features include, but are not limited to, markups that 
change font size, markups that add hyperlinks, strings 
types, row and column number of HTML table cells 
associated with strings, and row and column numbers of 
table cells with respect to the presentation of the serai- 
structured data within the incoming HTML information from 
the GUI 36 or the incoming HTML page 34. 

FIGURE 4 illustrates an example of decomposing an 
incoming HTML string 70 from the incoming HTML page 34 into 
a token list 72. The HTML phonemes chart 74 depicts each 
token 66 in the income HTML string 70 with its 
corresponding characteristic features. Each token 66 and 
its characteristic feature is called an HTML phoneme 44. 

FIGURE 5 is a representation of the GUI 36 used to 
extract information for the generation of the context-free 
grammar 52 in the pattern dictionary 50. The GUI 36 
provides the example generator 48 with a coarse structure 
of the structured data outputs 30 to be acquired. There 
are multiple coarse structures that will determine the 
acquisition of the structured data outputs 30. These 
coarse structures include: one data record, multiple data 
records from a row major form not necessarily an HTML 
table, multiple data records from a column major form not 
necessarily in an HTML table, and nested combinations of 
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the above three structures, including object-like 
structures. The GUI 36 provides the example generator 48 
with HTML phoneme representations of each example data 
value and phonemes to distinguish the coarse structure. 
Overall, the net input to the example generator 48 is a 
mapping of text in the input WEB training page 32 to data 
values and a structured record. 

FIGURE 6 is an example of the pattern dictionary 34 
generated from the example generator 48. Each pattern Pj 
consists of a sequence of HTML phonemes 40, p 0/ Pi...p n and 
a set of corresponding weights w 0/ w x ...w n . A terminal T 6 
for the context free grammar is assigned in one-to-one 
correspondence with each pattern in the pattern dictionary. 
The context free grammar represents the coarse structure 
and number of data values to be extracted from the serai- 
structured data source 34. Once the context-free grammar 52 
and the pattern dictionary 50 have been generated in the 
training stage 26, they are passed to the acquisition 
engine 54. An example of such an engine can be found in 
Modification of Barley's Algorithm for Speech Recognition, 
NATO ASI Series, Vol. F46, Paeseler, Annedore (1988), which 
is incorporated by reference herein in its entirety. 

The comparison of patterns from the pattern dictionary 
50 with an input stream of HTML phonemes 44 from the 
incoming HTML page 34 occurs in the comparison layer 58, 
In the comparison layer 58 a matching score between the 
pattern in the pattern dictionary 50 and a pattern found in 
the input stream is calculated. This matching score can be 
calculated using an weighted edit distance algorithm 
incorporating top-down methods with pruning or dynamic 
programming. Examples of such weighted edit distance 
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algorithms can be found in Pairwise Sequence Alignment, 
Geigerich, Robert, and Wheeler, David (last modified May, 
1996) , 

<http// : www . techf ak . uni -bielef eld . de/bcd/curric/PrwAli/pr^a 
li.html>, which is incorporated by reference herein in its 
entirety. This algorithm incorporates a normalized 
weighted sum of scores between phonemes from the pattern in 
the pattern dictionary 50 and a phoneme in the input steam 
of HTML phonemes 44 . Recall patterns in the pattern 
dictionary 50 may have different phonemes and each phoneme 
has a corresponding weight. Once the matching score is 
generated, the matching score and the matching pattern from 
the input HTML stream is supplied to the grammar layer 56. 
The grammar layer 56 implements a top-down parsing method 
based on a set of grammar rules from the context free 
grammar 52 to determine new patterns which can follow the 
previously found matching pattern. These new patterns are 
supplied to the comparison layer 58 to complete patterns at 
the grammar level from the pattern dictionary 50 with which 
the input stream of the HTML phonemes is to be compared. 
The process alternates between the grammar layer 56 and the 
comparison layer 58 until the last of the HTML phonemes 44 
from the incoming HTML page 34 are compared. The 
structured data outputs 3 0 are output based on the sequence 
of patterns that has the best cumulative matching score and 
corresponds to a correct parse of the document defined by 
the context free grammar 52 . 

The present invention has many advantages. First the 
use of a GUI 36 to extract the training information from 
the input WEB training page 22 hides all the technical 
details behind the builder 46 and the acquisition engine 
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54. These enables the use of the present invention by users 
with little or no previous knowledge of HTML and parsing 
methods . 

In addition the present invention requires a minimum 
of one data example from the input WEB training page 32 in 
the training stage 26 to acquire the desired structured 
data outputs 30. This eliminates time-consuming processes 
of presenting multiple examples in order to acquire and 
structure the desired data outputs 30. 

An important advantage is that the present invention 
is able to cope with format changes from the semi- 
structured data sources. Changes such as font size, font 
color, or permutations in the data value will not cause the 
acquisition engine to fail. The characteristic features 
which adorn the tokens 66 to create the phonemes 40, 44 
reflect properties including but not limited to format. 
Even if the page has undergone formatting changes, the 
original data value will still have some best match. Due 
to the cumulative characteristics of a pattern, the 
weighted edit distance almost always finds the correct 
match. 

It is important to note that regular grammars are a 
subset of context-free grammars. Therefore, the present 
invention will work properly for regular grammars, as well. 

In summary, the present invention provides a Method 
and System for Generating Structured Data from Semi- 
structured Data Sources. The steps of this method include 
generating an example output from an example generator. The 
example output is generated in response to the acquisition 
of a sequence of annotated strings. The annotated strings 
are generated in response to the acquisition and 
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modification of at least one data example and corresponding 
coarse structure from a predetermined input source. Also, 
a second sequence of annotated strings in generated from 
input from a semi -structured data source. Both the example 
5 output and second sequence of annotated strings are input 

to an acquisition engine that implements a grammar layer 
incorporating a top-down parsing method and a comparison 
layer. The structured data outputs are generated through 
the cooperation of the comparison layer and the grammar 

10 layer. The present invention is robust to formatting 

changes and permutations in the semi -structured data 
sources. In addition, the present invention is easy to use, 
requiring no prior knowledge of parsing languages or HTML. 
Although the present invention has been described in 

15 detail, it should be understood that various changes, 

substitutions and alterations can be made hereto without 
- r departing from the spirit and scope of the invention as 

described by the appended claims. 



WO 00/63783 



14 



PCT/US00/07792 



WHAT IS QhhimP IS; 

1. A method for generating structured data outputs 
from semi -structured data sources, said method comprising: 

generating an example output from an example generator 
in response to an acquisition of a first plurality of 
annotated strings, said first plurality of annotated 
strings generated from an acquisition and modification of 
at least one data example and a corresponding coarse 
structure from a predetermined input specified by an 
external source; 

generating a second plurality of annotated strings 
relating to an input from said semi -structured data 
sources ; 

acquiring said example output, and said second 
plurality of annotated strings in an acquisition engine, 
said acquisition engine comprising a grammar layer and a 
comparison layer; and 

generating structured data outputs from a cooperation 
of said grammar layer and said comparison layer, said 
grammar layer comprising a top-down parsing algorithm. 

2. The method of Claim 1, wherein said at least one 
data example is one data example. 

3 . The method of Claim 1 , wherein said example 
output is a context-free grammar and a pattern dictionary. 

4. The method of Claim 3, wherein said context-free 
grammar further comprises regular grammar. 
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5. The method of Claim 1, wherein said acquisition 
and modification of said at least one data example and said 
corresponding said coarse structure further comprises 
separating said at least one data example and corresponding 
coarse structure utilizing lexical analysis to form a first 
set of tokens and annotating said first set of tokens with 
characteristic features to produce said first plurality- 
annotated strings . 

6* The method of Claim 5, wherein said predetermined 
input is a first HTML page and said first set of tokens is 
a first set of HTML phonemes. 

7. The method of Claim 1, wherein the said 
predetermined input is a PDF file or a semi -structured text 
file. 

8. The method of Claim 1, wherein the step of 
generating said second plurality of annotated strings 
further comprises preprocessing said input from said semi- 
structured data sources to form said second plurality of 
annotated strings . 

9. The method of Claim 8, wherein the step of 
preprocessing further comprises separating said input from 
said semi -structured data sources utilizing lexical 
analysis to form a -second set of tokens and annotating said 
second set of tokens with characteristic features to 
produce said second plurality of annotated strings* 
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10. The method of Claim 9, said semi -structured data 
sources are one or more related HTML pages and said second 
set of tokens is a second set of HTML phonemes. 

11. The method of Claim 1, wherein said acquisition 
of said at least one data example and said coarse structure 
from a predetermined input further comprises a user 
interface for identification of said at least one data 
example and said coarse structure. 

12. The method of Claim 11, wherein said user 
interface is a Graphical User Interface (GUI) and said 
predetermined input is an HTML page. 

13. The method of Claim 3, wherein the step of 
generating said pattern dictionary further comprises 
assigning a pattern consisting of a portion of said 
sequences of annotated strings to each of said at least one 
data example and assigning additional patterns to 
distinguish said corresponding coarse structure from said 
predetermined input . 

14. The method of Claim 13, wherein the step of 
generating said context-free grammar further comprises 
generating terminals that are in one-to-one correspondence 
with said patterns in said pattern dictionary. 

15. The method of Claim 1, wherein said cooperation 
of said grammar layer and said comparison layer further 
comprising: 
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sequentially comparing in said comparison layer said 
patterns in said pattern dictionary against the said second 
sequence of annotated strings to find a matching pattern in 
a portion of said second sequence of annotated strings; 

compiling a matching score representing a quality of a 
match between said patterns in said pattern dictionary and 
said matching pattern; 

passing said matching score and said matching pattern 
to said grammar layer, 

extending already found matching patterns with said 
matching pattern to form a sequence of matching patterns; 
and 

executing a set of grammar rules defined by said 
context-free grammar on said sequence of matching patterns 
to locate a legal sequence of strings defined by said set 
of grammar rules and representing said structured data 
outputs . 

16. The method of Claim 15 wherein compiling the 
matching score further comprises implementing a weighted 
edit distance algorithm to calculate the matching score. 

17. The method of Claim 16, wherein the weighted edit 
distance algorithm is a top down method with pruning. 

18. The method of Claim 16, wherein the weighted edit 
distance algorithm is a dynamic programming method. 

19. A system comprising a computer program stored in 
a computer readable form on a tangible storage medium for 
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generating structured data outputs from semi -structured 
data sources, the computer program executable to: 

generate an example output from an example generator 
in response to an acquisition of a first plurality of 
5 annotated strings, said first plurality of annotated 

strings generated from an acquisition and modification of 
at least one data example and a corresponding coarse 
structure from a predetermined input specified by an 
external source; 
10 generate second plurality of annotated strings 

relating to an input from said semi -structured data 
sources ; 

acquire said example output, and said second plurality 
of annotated strings in an acquisition engine, said 
15 acquisition engine comprising a grammar layer and a 

comparison layer; and 

generate structured data outputs from a cooperation of 
said grammar layer and said comparison layer, said grammar 
layer comprising a top-down parsing algorithm. 

20 

20. The system of Claim 19, wherein said at least one 
data example is one data example. 

21. The system of Claim 19, wherein said example 

25 output is a context-free grammar and a pattern dictionary. 

22. The system of Claim 21, wherein said context-free 
grammar further comprises regular grammar. 

30 23. The system of Claim 19, further executable to 

separate said at least one data example and corresponding 
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coarse structure utilizing lexical analysis to form a first 
set of tokens and annotate said first set of tokens with 
characteristic features to produce said first plurality of 
annotated strings . 

24. The system of Claim 23, wherein said 
predetermined input is a first HTML page and said first set 
of tokens is a first set of HTML phonemes. 

25. The system of Claim 19, wherein said 
predetermined input is a PDF file or a semi -structured text 
file. 

26. The system of Claim 19, wherein to generate said 
second plurality of annotated strings is further executable 
to preprocess said input from said semi -structured data 
sources to form said second plurality of annotated strings. 

27. The system of Claim 26, wherein to preprocess is 
further executable to separate said input from said semi- 
structured data sources utilizing lexical analysis to form 
a second set of tokens and annotate said second set of 
tokens with characteristic features to produce said second 
plurality annotated strings, 

28. The system of Claim 27, wherein said semi- 
structured data sources are one or more related HTML pages 
and said second set of tokens is a second set of HTML 
phonemes . 
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29. The system of Claim 19 , wherein said acquisition 
of said at least one data example and said coarse structure 
from a predetermined input further comprises a user 
interface for identification of said at least one data 
example and said coarse structure. 

30. The system of Claim 29, wherein said user 
interface is a Graphical User Interface (GUI) and said 
predetermined input is an HTML page. 



31. The system of Claim 21, wherein to generate said 
pattern dictionary is further executable to assign a 
pattern consisting of a portion of said sequences of 
annotated strings to each of said at least one data example 

15 and assign additional patterns to distinguish said 

corresponding coarse structure from said predetermined 
input ♦ 

32. The system of Claim 31, wherein to generate said 
20 context-free grammar is further executable to generate 

terminals that are in one-to-one correspondence with said 
'patterns in said pattern dictionary. 



33. The system of Claim 19, wherein said cooperation 
25 of said grammar layer and said comparison layer is further 

executable to : 

sequentially compare in said comparison layer said 
patterns in said pattern dictionary against the said second 
sequence of annotated strings to find a matching pattern in 
30 a portion of said second sequence of annotated strings; 
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compile a matching score representing a quality of a 
match between said patterns in said pattern dictionary and 
said matching pattern; 

pass said matching score and said matching pattern to 
said grammar layer. 

extend already found matching patterns with said 
matching pattern to form a sequence of matching patterns; 
and 

execute a set of grammar rules defined by said 
context-free grammar on said sequence of matching patterns 
to locate a legal sequence of strings defined by said set 
of grammar rules and representing said structured data 
outputs, 

34. The system of Claim 33 wherein to compile the 
matching score further executable to implement a weighted 
edit distance algorithm to calculate the matching score. 

35. The system of Claim 34, wherein the weighted edit 
distance algorithm is a top down method with pruning. 

36. The system of Claim 34, wherein the weighted edit 
distance algorithm is a dynamic programming method. 
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